Search CORE

72 research outputs found

DETECTING GENETIC ENGINEERING WITH A KNOWLEDGE-RICH DNA SEQUENCE CLASSIFIER

Author: Ge Yuchen
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 06/01/2021
Field of study

Detecting evidence of genetic engineering in the wild is a problem of growing importance for biosecurity, provenance, and intellectual property rights. This thesis describes a computational system designed to detect engineering from DNA sequencing of biological samples and presents its performance on fully blinded test data. The pipeline builds on existing computational resources for metagenomics, including methods that use the full set of reference genomes deposited in GenBank. Starting from raw reads generated from short-read sequencers, the dominant host species are identified by k-mer analysis. Next, all the sequencing reads are mapped to the imputed host strain; those reads that do not map are retained as suspicious. Suspicious reads are de novo assembled to suspicious contigs, followed by sequence alignment against the NCBI non-redundant nucleotide database to annotate the engineered sequence and to identify whether the engineering is in a plasmid or is integrated into the host genome. Our initial system applied to blinded samples provides excellent identification of foreign gene content, the changes most likely to be functional. We have less ability to detect functional structural variants and small indels and SNPs produced by genetic engineering but which are more difficult to distinguish from natural variation. Future work will focus on improved methods for detecting synonymous recoding, used to introduce watermarks and for compatibility with synthesis and assembly methods, for using long read sequence data, and for distinguishing engineered sequence from natural variation

JScholarship

Load carrying capability of regional electricity-heat energy systems:Definitions, characteristics, and optimal value evaluation

Author: Cao Yuchen
Ge Shaoyun
Gu Chenghong
He Xingtang
Liu Hong
Xu Zhengyang
Publication venue: 'Elsevier BV'
Publication date: 15/03/2022
Field of study

OPUS

pTSE: A Multi-model Ensemble Method for Probabilistic Time Series Forecasting

Author: Chu Zhixuan
Huang Yuchen
Jin Ge
Li Sheng
Ruan Yijia
Zhou Yunyi
Publication venue
Publication date: 16/05/2023
Field of study

Various probabilistic time series forecasting models have sprung up and shown remarkably good performance. However, the choice of model highly relies on the characteristics of the input time series and the fixed distribution that the model is based on. Due to the fact that the probability distributions cannot be averaged over different models straightforwardly, the current time series model ensemble methods cannot be directly applied to improve the robustness and accuracy of forecasting. To address this issue, we propose pTSE, a multi-model distribution ensemble method for probabilistic forecasting based on Hidden Markov Model (HMM). pTSE only takes off-the-shelf outputs from member models without requiring further information about each model. Besides, we provide a complete theoretical analysis of pTSE to prove that the empirical distribution of time series subject to an HMM will converge to the stationary distribution almost surely. Experiments on benchmarks show the superiority of pTSE overall member models and competitive ensemble methods.Comment: The 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023

arXiv.org e-Print Archive

A Survey of Source Code Search: A 3-Dimensional Perspective

Author: Chen Yuchen
Chen Zhenyu
Fang Chunrong
Ge Xiuting
Ge Yifei
Hu Yuling
Liu Yang
Sun Weisong
Zhang Quanjun
Publication venue
Publication date: 13/11/2023
Field of study

(Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this paper, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog

arXiv.org e-Print Archive

A versatile route to fabricate single atom catalysts with high chemoselectivity

Author: Chen Hongyu
Deng Yuchen
Ge Binghui
He Qian
He Xiaohui
Ji Hongbing
Ma Ding
Peng Mi
Xiao Dequan
Yao Siyu
Zhang Mengtao
Zhang Ying
Publication venue: Digital Commons @ New Haven
Publication date: 01/01/2019
Field of study

Preparation of single atom catalysts (SACs) is of broad interest to materials scientists and chemists but remains a formidable challenge. Herein, we develop an efficient approach to synthesize SACs via a precursor-dilution strategy, in which metalloporphyrin (MTPP) with target metals are co-polymerized with diluents (tetraphenylporphyrin, TPP), followed by pyrolysis to N-doped porous carbon supported SACs (M1/N-C). Twenty-four different SACs, including noble metals and non-noble metals, are successfully prepared. In addition, the synthesis of a series of catalysts with different surface atom densities, bi-metallic sites, and metal aggregation states are achieved. This approach shows remarkable adjustability and generality, providing sufficient freedom to design catalysts at atomic-scale and explore the unique catalytic properties of SACs. As an example, we show that the prepared Pt1/N-C exhibits superior chemoselectivity and regioselectivity in hydrogenation. It only converts terminal alkynes to alkenes while keeping other reducible functional groups such as alkenyl, nitro group, and even internal alkyne intact

Digital Commons @ New Haven

Major data analysis errors invalidate cancer microbiome findings

Author: Brewer Daniel S.
Cooper Colin S.
Ge Yuchen
Gihawi Abraham
Lu Jennifer
Pertea Mihaela
Puiu Daniela
Salzberg Steven L.
Xu Amanda
Publication venue
Publication date: 01/10/2023
Field of study

We re-analyzed the data from a recent large-scale study that reported strong correlations between DNA signatures of microbial organisms and 33 different cancer types and that created machine-learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (i) errors in the genome database and the associated computational methods led to millions of false-positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (ii) errors in the transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine-learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well

Directory of Open Access Journals

University of East Anglia digital repository

Genetic Diversity and Linkage Disequilibrium in Chinese Bread Wheat (Triticum aestivum L.) Revealed by SSR Markers

Author: A Horvath
A Rafalski
AJ Garris
ATW Kraakman
BA Payseur
BS Gaut
Chenyang Hao
CY Hao
CY Hao
DJ Schoen
DJ Somers
DJ Somers
DL Remington
E Pestsova
F Balfourier
F Breseghello
FJ Rohlf
FY Yeh
G Evanno
GX You
Hongmei Ge
J Wang
JB Yan
JK Pritchard
K Liu
KA Mather
L Excoffier
Lanfen Wang
LL Cavalli-Sforza
LV Malysheva-Otto
M Maccaferri
M Nei
MA Chapman
MS Röder
MS Röder
MT Hamblin
NA Rosenberg
PJ Sharp
PK Gupta
PW Hedrick
Pär Ingvarsson
QS Zhuang
RH Wang
RJ Petit
S Chao
S Dreisigacker
SA Flint-Garcia
SD Tanksley
SP Dickson
V Roussel
WC Knowler
X Perrier
XQ Huang
Xueyong Zhang
XY Zhang
XY Zhang
YH Li
YH Li
YL Zhu
YS Dong
Yuchen Dong
ZA Guo
ZW Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Two hundred and fifty bread wheat lines, mainly Chinese mini core accessions, were assayed for polymorphism and linkage disequilibrium (LD) based on 512 whole-genome microsatellite loci representing a mean marker density of 5.1 cM. A total of 6,724 alleles ranging from 1 to 49 per locus were identified in all collections. The mean PIC value was 0.650, ranging from 0 to 0.965. Population structure and principal coordinate analysis revealed that landraces and modern varieties were two relatively independent genetic sub-groups. Landraces had a higher allelic diversity than modern varieties with respect to both genomes and chromosomes in terms of total number of alleles and allelic richness. 3,833 (57.0%) and 2,788 (41.5%) rare alleles with frequencies of <5% were found in the landrace and modern variety gene pools, respectively, indicating greater numbers of rare variants, or likely new alleles, in landraces. Analysis of molecular variance (AMOVA) showed that A genome had the largest genetic differentiation and D genome the lowest. In contrast to genetic diversity, modern varieties displayed a wider average LD decay across the whole genome for locus pairs with r2>0.05 (P<0.001) than the landraces. Mean LD decay distance for the landraces at the whole genome level was <5 cM, while a higher LD decay distance of 5–10 cM in modern varieties. LD decay distances were also somewhat different for each of the 21 chromosomes, being higher for most of the chromosomes in modern varieties (<5∼25 cM) compared to landraces (<5∼15 cM), presumably indicating the influences of domestication and breeding. This study facilitates predicting the marker density required to effectively associate genotypes with traits in Chinese wheat genetic resources

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Identification and Characterization of a Leucine-Rich Repeat Kinase 2 (LRRK2) Consensus Phosphorylation Motif

Author: A Garcia-Regalado
A Kumar
A Quan
A Sasaki
A Zimprich
AB West
AB West
AE Karnoub
Andrew B. West
B Roger
BE Turk
BH Lower
BI Giasson
Brian Bates
C Henchcliffe
C Paisan-Ruiz
C Sanchez
C Tu
CC Ho
CJ Gloeckner
CJ Gloeckner
CL Klein
D MacLeod
DJ Moore
DQ Yang
E Greggio
E Greggio
ED Plowey
EM Schaefer
ES Roach
Eugene L. Brown
EY Chan
FC Dorsey
G Ito
GE Crooks
IF Mata
J Alegre-Abarrategui
J Deng
J Lotharius
JB Schulz
JB Schulz
JC Obenauer
JE Hutti
JJ Hill
JP McGrath
K Fujii
K Richter
KE Paleologou
Kerri Lipinski
L Dehmelt
L Guo
L Mishra
L Parisiadou
M Jaleel
M Liu
M Trahey
Mel B. Feany
N Dephoure
N Shin
O Bembom
P Yu
PA Jaeger
PA Lewis
Peter H. Reinhart
Pooja P. Pungaliya
Q Huang
RJ Nichols
RT Abraham
S Biskup
S Kamikawaji
S Sen
Saurabh Sen
Steven P. Braithwaite
T Hunter
T O'Neill
TD Schneider
Vasanti S. Anand
VS Anand
Warren D. Hirst
WW Smith
X Li
X Lin
Y Imai
Y Xiong
Yuchen Bai
Z Songyang
Z Songyang
Publication venue: Public Library of Science
Publication date: 01/10/2010
Field of study

Mutations in LRRK2 (leucine-rich repeat kinase 2) have been identified as major genetic determinants of Parkinson's disease (PD). The most prevalent mutation, G2019S, increases LRRK2's kinase activity, therefore understanding the sites and substrates that LRRK2 phosphorylates is critical to understanding its role in disease aetiology. Since the physiological substrates of this kinase are unknown, we set out to reveal potential targets of LRRK2 G2019S by identifying its favored phosphorylation motif. A non-biased screen of an oriented peptide library elucidated F/Y-x-T-x-R/K as the core dependent substrate sequence. Bioinformatic analysis of the consensus phosphorylation motif identified several novel candidate substrates that potentially function in neuronal pathophysiology. Peptides corresponding to the most PD relevant proteins were efficiently phosphorylated by LRRK2 in vitro. Interestingly, the phosphomotif was also identified within LRRK2 itself. Autophosphorylation was detected by mass spectrometry and biochemical means at the only F-x-T-x-R site (Thr 1410) within LRRK2. The relevance of this site was assessed by measuring effects of mutations on autophosphorylation, kinase activity, GTP binding, GTP hydrolysis, and LRRK2 multimerization. These studies indicate that modification of Thr1410 subtly regulates GTP hydrolysis by LRRK2, but with minimal effects on other parameters measured. Together the identification of LRRK2's phosphorylation consensus motif, and the functional consequences of its phosphorylation, provide insights into downstream LRRK2-signaling pathways

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central